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Methodology 


The  steps  in  code  for  solving  large,  sparse,  banded  systems  of  equations 
are  outlined  in  Table  1.  The  solver  uses  either  Gaussian  Elimination 
or  Cholesky  Decomposition.  Figure  1  is  a  schematic  flow  chart  of  the 
solver.  A  complete  solution  has  three  steps:  (1)  factorization,  requiring 
the  updating  of  data  in  local  node  memory,  block  broadcast  of  data  to 
other  nodes,  block  receive  of  data  from  other  nodes  and  updating  of 
local  data  using  the  received  data;  (2)  forward  elimination,  requiring 
updating  of  variables  local  to  the  node  and  broadcasting  of  update  in¬ 
formation  to  other  nodes;  (3)  backward  substitution,  requiring  solving 
for  the  variables  local  to  each  node  and  the  sending  of  update  infor¬ 
mation  to  other  nodes.  Multiple  right  hand  sides  can  be  used  with  a 
single  factorization. 

Implementation  of  this  algorithm  on  the  CM5E  posed  several  problems. 
The  major  problem  was  the  limitations  imposed  by  CM  FORTRAN.  It 
is  desireable  to  have  the  matrix  data  distributed  equally  to  the  nodes 
in  a  block  round-robin  fashion  in  order  to  optimize  the  usage  of  the 
four  vector  pipelines  per  node.  As  is  sketched  out  in  Figure  2,  the 
CM  FORTRAN  compiler  distributes  a  vector  of  N  elements,  a  row  of 
the  matrix  in  this  algorithm,  uniformly  to  the  four  vector  pipelines  as 
N/4,  N/4,  N/4  and  N/4.  Operations  on  elements  in  any  one  pipeline  in 
effect  disables  the  other  pipelines.  Thus  very  low  vector  performance 
results.  As  is  sketched  in  Figure  3,  redefining  a  vector  as  an  array  with 
the  second  dimension  parallel  solved  this  problem.  The  speedup  over 
the  ’’natural”  data  distribution  was  as  high  as  80. 
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Results 


A  series  of  test  problems  have  been  run  to  measure  the  performance 
of  the  code.  The  problems  are:  (A):  10,000  equations  with  an  average 
half  band  width  of  500.  (B):  50,000  equations  with  an  average  half 
band  width  of  1000.  (C):  100,000  equations  with  an  average  half  band 
width  of  4000.  The  results  are  presented  in  the  table  below.  In  this 
table  (*)  denotes  an  estimated  value. 


Problem 

Number  of 

Run  Time 

Processing  Rate 

CM5  Nodes 

(Seconds) 

(Megaflops) 

A 

1 

21.0 

4 

10.1 

16 

6.5 

64 

5.5 

128 

5.5 

B 

1 

*1666 

*30 

64 

76 

657 

128 

65 

760 

256 

59 

847 

C 

1 

*40000 

*40 

64 

1047 

1538 

128 

723 

2222 

256 

681 

2349 

Small  problems,  such  as  (A),  are  communication  bound  and  no  im¬ 
provement  in  performance  occurs  with  more  than  4  to  8  nodes.  Medium 
size  problems,  such  as  (B)  can  effectively  use  up  to  64  nodes  with  mod¬ 
est  improvement  for  additional  nodes;  again  communication  cost  limits 
performance  to  just  under  a  Gigaflop.  Large  problems,  such  as  (C), 
can  effectively  use  128  nodes  with  modest  improvement  for  256  nodes; 
communication  dominates  for  large  numbers  of  nodes  but  performance 
of  2.2  to  2.3  Gigaflops  is  acheived. 


2 


Summary 


A  general  purpose  solver  for  large,  sparse,  banded  systems  of  equations 
has  been  developed  and  tested  on  the  CM5E.  This  tool  can  be  applied 
to  any  problem  arising  in  finite  element  analysis  such  as  structures, 
machinery,  acoustic-structure  interaction  among  many  others.  For  a 
very  large  problem  a  factorization  and  solution  takes  10  to  12  minutes; 
solutions  for  additional  right  hand  sides  will  take  on  the  order  of  one 
minute  each. 
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PARALLEL  LINEAR  SOLVERS 


(2)  MESSAGE  SENDING  AND  UPDATING 


Figure  2 


CMFORTRAN  LIMITATIONS 


-DISTRIBUTES  DATA  AMONG  FOUR  VECTOR  PIPELINES  AT 
COMPILE  TIME. 

-  LOOP  OPERATION  ON  SEGMENTS  OF  THE  DECLARED  VECTOR 
TAKE  FIXED  TIME 

-  FLOP  RATES  AS  LOW  AS  0.5  MFLOPS  FOR  LONG  VECTORS. 


EX:  PARALLEL  ARRAY  A  (256) 


-  OPERATIONS  ON  FIRST  16  ELEMENTS  USE  ALL  VECTOR  PIELINES  AND 
USE  SAME  ELAPSED  TIME  AS  OPERATING  ON  ALL  256  ELEMENTS. 


Figure  3 

MATRIX  DATA  MODIFICATION  FOR  VECTOR  EFFICIENCY  UNDER  CMFORTRAN 

-  ARRAY  A  (256)  IS  REDEFINED  AS  TWO  DIMESIONAL  VECTOR  WITH 
AXIS  2  AS  PARALLEL  DIMENSION. 

CHANGES  -  ARRAY  A  (256)  — -  ARRAY  A  (16, 16) 

-  FLOP  RATES  AS  HIGH  AS  40.0  MFLOPS  FOR  LONG  VECTORS. 


EX:  PARALLEL  ARRAY  A  (16, 16) 


-OPERATION  ON  FIRST  16  ELEMENTS  ON  SECOND  DIMENSION  USES 
ALL  FOUR  VECTOR  PIPELINES. 


